< back


CancerSEEK -- New Sequencing - Aneuploidy only & Full Dataset


Short Summary:


Cancer detection through the multi-analyte blood test CancerSEEK

Referring to this original publication and this, with a new sequencing method applied.


Content

Data Exploration and Cleaning

This time there seem to be adequate number of Patient IDs and in accordance to what's communicated in the publication; 1695. We can also see that there are 883 patients with cancer as displayed by the non-null values in column AJCC Stage.

Furthermore, we can drop columns that are not needed, such as Sample ID # and Patient ID #, or are too related to the target variable, as AJCC Stage, as well as replacing NaN values with 0 in the columns Aneuploidy and Mutation.

Display missing values

Some algorithms will have probelms with missing values. As there are quite some in the Aneuploidy and Mutation columns they will be replaced with nulls.

Split into train and test sets

Transformation of train and test sets

Start by creating custom transformers that can do the desired transformation. From the Supportive material related with the publication:

To account for variations in the lower limits of detection across different experiments, we found the 90th percentile feature value in the healthy training samples. We then found any feature value below that threshold and set all values to the 90th percentile threshold. This transformation was done for all training and testing samples. This procedure was done for aneuploidy scores, somatic mutation scores, and protein concentrations.

Visualise the data

Visualise the data using Boxplots, histograms and correlation plots to get a better understanding of it.

Start by plotting the difference in distribution before and after winsorizing of the healthy samples using histograms. Display the cancerous distribution to the far right.

As displayed above there are considerable differences in distribution and their size after applying winsorizing compared with before. The spikes on the far right side on the right plots are fairly obvious but also reasonable.

Continue by plotting the total count of each Tumor type, including healthy Normal.

Plot the target variable Tumor type relative the variables using boxplots while histograms aids the understanding of each variable's distribution.

Following histograms shows the distribution for each variable as well as t-statistic and p-values from a Normal test. None of the variables are normally distributed.

Plot correlation matrix to display correlation between the target variable and features as well as among features.

There is only minor collinearity between features, no need to drop any.

Model Evaluation

Pipeline on full feature set - Individual Models

The pipeline includes all steps that are needed to run 10-fold cross-validation and will take the cleaned data to perform data transformation, model tuning and the model evaluation steps. For visualisation purposes, the transformation steps are covered earlier in the notebook, but in order to run unbiased cross-validation those steps have to be executed concurrently in each fold during cross-validation.

In order to achieve this, a custom transformer, that replaces all protein, omega and mutation values that are lower than the 90th percentile healthy sample with the 90th percentile value were created earlier.

Select numerical columns to feed the numerical pipeline. Of course, in this case there are only numerical columns to feed the pipelile with.

Plot some statistics on above models. For some reasons the shap and feature importance plots need to be plotted separately. Inconsistent output if not.

Plot shap values for supported classifiers.

Pipeline with Combined Models

VotingClassifier

Try VotingClassifier with above estimators. Use voting="soft" as that often works better for well calibrated models. More weight to more performant models.

A VotingClassifier, with higher weights given to the most performant individual estimators CatBoost and XGBoost, improves specificity slightly more to more than 99%. Only 8 out of 812 cancer samples are incorrectly classified.

StackingClassifier

Run several experiments by stacking previously trained models to combine a larger, hopefully, more powerful model. Start by using Logistic Regression as final estimator in the StackingClassifier.

Take 1

Take 2

Use Random Forest as final estimator. Set n_estimators=300 and max_depth=4 based on generally good performance in earlier experiments.

Take 3

Use XGBoost as final estimator and train it on both the predictions from the four estimators as well as the original data by setting passthrough = True.

Take 4

Use XGBoost with standard hyper parameters as final estimator in the Stacked Classifier. Using standard hyper parameters decreases the risk of overfitting on this particular dataset and is more likely to perform similarly on an independent dataset.

Take 5

Use CatBoost as final meta estimator.

Most performant Stacking Classifier is with CatBoost as meta estimator, finding more than 98% of the cancer samples. Precision is also high at above 98%.

Pipeline on Aneuploidyonly

Now, train and evaluate various models using only the Aneuploidy feature.

Individual Models

Plot some statistics on above models. For some reasons the shap and feature importance plots need to be plotted separately. Inconsistent output if not.

Plot shap values for supported classifiers.

The results for the individual models, and more specifically specificity, are considerabel lower than with all ten features. At most, some 78% of the cancer samples are correctly classified with precision hovering around 94%. This is achieved with KNN, Random Forest, CatBoost and LightGBM.

Pipeline with Combined Models - Aneuploidy

VotingClassifier

Try VotingClassifier with above estimators. Use voting="soft" as that often works better for well calibrated models. More weight to more performant models.

Combining several estimators by voting has not improved performance.

StackingClassifier

Run several experiments by stacking previously trained models to combine a larger, hopefully, more powerful model. Start by using Logistic Regression as final estimator in the StackingClassifier.

Take 1

Take 2

Use Random Forest as final estimator. Set n_estimators=300 and max_depth=4 based on generally good performance in earlier experiments and to prevent tuning the hyper parameters excessively.

Take 3

Use XGBoost as final estimator and train it on both the predictions from the four estimators as well as the original data by setting passthrough = True.

Take 4

Use XGBoost with standard hyper parameters as final estimator in the Stacked Classifier. Using standard hyper parameters decreases the risk of overfitting on this particular dataset and is more likely to perform similarly on an independent dataset.

Take 5

Use CatBoost as final meta estimator.

Combining estimators through stacking has in general decreased performance compared with the single model alternatives. The only combination that increase performance is when Random Forest is used as meta estimator. 80% of all cancer samples are correctly classified while maintaining previous precision at 94%.

Model summary

Final Model - Full Feature Set

Most performant model on the full feature set is a single XGBoost. It correctly classified 98% of all cancer samples with high precision at 99%. Specificity, fraction of healthy samples correctly classified, was also high at 99% and corresponding precision 98%.

Final Model - Aneuploidy only

Results are considerable lower when only the single Aneuploidy feature is considered during modelling. At most, 80% of the cancer samples are correctly classified with a precision at 94% when Random Forest is used as meta estimator on the KNN, Gradient Boosting, CatBoost, LightGBM and XGBoost predictions.

Conclusions

Two different set of models have been optimized

By using the entire 10 feature set the results are considerable better than when only the single Aneuploidy feature is used. It correctly classifies 98% of the cancer samples (compared with only Aneuploidy at 80%) with 99% precision (94%). Specificity, fraction of correctly classified healthy samples, is 99% with corresponding precision at 98%. It shall be noted that the VotingClassifier performed well as well, achieving slightly higher specificity at above 99%.

The results on the entire feature set are promising, giving confident indications on whether a patient has cancer or not.

to the top